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Abstract — Motivated  by  the  type  of  missions  currently  per¬ 
formed  by  unmanned  aerial  vehicles,  we  investigate  a  discrete 
dynamic  vehicle  routing  problem  with  a  potentially  large 
number  of  targets  and  vehicles.  Each  target  is  modeled  as  an  in¬ 
dependent  two-state  Markov  chain,  whose  state  is  not  observed 
if  the  target  is  not  visited  by  some  vehicle.  The  goal  for  the 
vehicles  is  to  collect  rewards  obtained  when  they  visit  the  targets 
in  a  particular  state.  This  problem  can  be  seen  as  a  type  of 
restless  bandits  problem  with  partial  information.  We  compute 
an  upper  bound  on  the  achievable  performance  and  obtain  in 
closed  form  an  index  policy  proposed  by  Whittle.  Simulation 
results  provide  evidence  for  the  outstanding  performance  of  this 
index  heuristic  and  for  the  quality  of  the  upper  bound. 

I.  INTRODUCTION 

Unmanned  aerial  vehicles  (UAVs)  are  actively  used  for 
military  operations  and  considered  for  civilian  applications 
such  as  environmental  monitoring.  Technological  advances  in 
this  area  have  been  impressive,  and  it  seems  now  that  a  major 
challenge  for  future  developments  will  be  to  increase  the 
degree  of  automation  of  these  systems  [1],  For  this  we  need 
solutions  with  acceptable  levels  of  performance  to  difficult 
optimization  problems,  such  as  variants  of  the  weapon-target 
assignment  problem  [2],  Often,  the  problems  solved  are 
static  combinatorial  optimization  problems  resulting  in  open- 
loop  policies.  Yet,  for  most  applications  of  UAVs,  involving 
surveillance  and  monitoring,  we  would  like  to  factor  into 
the  decision  making  process  the  (stochastic)  evolution  of  the 
environment,  which  results  in  even  harder  stochastic  control 
problems. 

In  this  paper,  we  consider  the  following  scenario.  A  group 
of  M  mobile  sensors  (also  denoted  agents  in  the  following) 
is  tracking  the  states  of  N  >  M  sites.  We  discretize  time.  At 
each  period,  each  site  can  be  in  one  of  two  states  {si,.?2}, 
but  we  only  know  the  state  of  a  site  with  certainty  if  we 
actually  visit  it  with  a  sensor.  For  i  £  {1 , ...,1V},  the  state 
of  site  i  changes  from  one  period  to  the  next  according  to  a 
Markov  chain  with  known  transition  probability  matrix  P\ 
independently  of  the  fact  that  a  sensor  is  present  or  not,  and 
independently  of  the  other  sites.  To  specify  P1,  it  is  sufficient 
to  give  P\ !  and  P'2X,  which  are  the  probabilities  of  transition 
to  state  ,V|  from  state  ,v  i  and  sj  respectively.  When  a  sensor 
explores  site  i,  it  can  observe  its  state  without  measurement 
error,  and  obtains  a  reward  R1  if  the  site  is  in  state  ,v  i .  There 
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is  no  cost  for  moving  the  agents  between  the  sites.  We  want 
to  determine  how  we  should  allocate  the  agents  at  each  time 
period,  in  order  to  maximize  an  expected  total  discounted 
cost  over  an  infinite  horizon. 

This  problem  is  related  to  various  sensor  management 
problems.  These  problems  have  a  long  history  [3],  [4],  but 
have  enjoyed  a  renewed  interest  more  recently.  Close  to  the 
ideas  of  this  work,  we  mention  the  use  by  Krishnamurthy  and 
Evans  [5],  [6]  of  Gittins’  solution  to  the  multi-armed  bandit 
problem  to  direct  a  radar  beam  towards  multiple  moving 
targets.  La  Scala  and  Moran  [7]  suggest  to  use  instead  for 
a  similar  problem  the  restless  bandits  model,  as  we  do  here. 
However,  in  the  restricted  symmetric  cases  that  [7]  considers, 
the  greedy  solution  is  optimal  and  Whittle’s  indices  and 
upper  bound  are  not  computed.  Whittle  already  mentioned 
the  potential  application  of  restless  bandits  to  airborne  sensor 
routing  in  his  original  paper  [8].  Recently,  a  slightly  more 
general  version  of  our  problem  was  considered  independently 
by  Guha  et  al.  [9],  in  the  average-cost  setting,  to  schedule 
transmissions  on  wireless  communication  channels  in  differ¬ 
ent  states.  These  authors  propose  a  policy  that  is  different 
from  Whittle’s  and  offers  a  performance  guarantee  of  2. 

Let  us  start  by  briefly  recalling  the  multi-armed  bandit 
problem  (MABP)  and  restless  bandits  problem  (RBP).  The 
classical  MABP  concerns  N  sites  or  projects,  where  the  state 
of  project  i  at  discrete  time  t  is  x\.  At  each  time  t,  only  one 
project  can  be  worked  on.  Then  a  reward  r'(xj)  is  received, 
and  the  state  x\  evolves  to  xj+1  according  to  a  known  Markov 
rule  specific  to  project  i.  The  N  —  1  projects  that  are  not 
operated  produce  no  reward  and  their  states  do  not  change. 
The  important  result  of  Gittins  [10],  [11]  is  that  the  rich 
structure  of  this  problem  makes  possible  an  efficient  solution. 
Optimal  policies  turn  out  to  have  the  form  of  an  index  rule. 
That  is,  we  can  compute  independently  for  each  project  an 
index  A'(xj)  £  R.  such  that  the  optimal  policy  is  to  operate 
at  each  period  the  project  with  the  maximal  index. 

The  assumptions  made  in  the  MAPB  inhibit  its  applicabil¬ 
ity  for  the  sensor  management  problem.  Suppose  one  has  to 
track  the  state  of  N  targets  evolving  independently.  Lirst,  the 
MABP  solution  helps  scheduling  only  one  sensor,  since  only 
one  target  can  be  worked  on  at  each  period.  Moreover,  even 
if  one  does  not  make  new  measurements  on  a  specific  target, 
its  information  state  still  has  to  be  updated  using  the  known 
dynamics  of  the  true  state.  This  violates  the  assumption  that 
the  projects  that  are  not  operated  remain  frozen.  To  use 
Gittins’  result  for  this  problem,  [5]  must  assume  that  the 
dynamics  of  the  targets  are  slow  and  that  the  propagation 
step  of  the  filters  can  be  neglected  for  unobserved  targets. 


To  overcome  the  shortcomings  of  the  MABP,  Whittle 
introduced  the  RBP  [8].  In  this  problem,  we  now  allow  forM 
projects  to  be  simultaneously  operated,  rewards  can  be  gener¬ 
ated  for  the  projects  that  are  not  active,  and  most  importantly 
these  projects  are  also  allowed  to  evolve,  possibly  according 
to  different  transition  rules.  These  less  stringent  assumptions 
are  very  useful  for  the  sensor  management  problem,  but 
unfortunately  the  RBP  is  now  known  to  be  intractable, 
in  fact  PS  PACE-hard  [12],  even  if  M  =  1  and  we  only 
allow  deterministic  transition  rules.  Nonetheless,  Whittle 
investigated  an  interesting  relaxation  and  index  policy  for 
this  problem,  which  extends  Gittins’  and  which  we  will 
review  in  section  IV  in  our  specific  context.  The  relaxation 
technique  has  been  used  apparently  independently  for  sensor 
management  problems  by  Castanon  [13],  [14],  who  does  not 
develop  index  policies  however.  [15]  also  investigates  the 
relaxation  technique  in  a  more  general  setting. 

The  rest  of  the  paper  is  organized  as  follows.  In  section 

II,  we  give  a  precise  formulation  of  our  problem.  In  section 

III,  we  provide  a  counter  example  showing  that  the  obvious 
candidate  greedy  solution  to  the  problem  is  not  optimal. 
Section  IV  gives  a  general  discussion  of  our  proposed 
solution  to  this  sensor  routing  problem.  Whittle’s  method 
is  discussed  with  an  emphasis  on  computations  and  from  the 
point  of  view  of  constrained  Markov  decision  processes  [16]. 
An  upper  bound  on  the  achievable  performance  is  obtained 
by  solving  a  relaxed  problem  using  a  Lagrangian  approach 
and  subgradient  optimization.  A  lower  bound  is  obtained 
by  computing  Whittle’s  index  policy.  The  computation  of 
Whittle’s  indices  is  non  trivial  in  general,  and  the  indices 
may  not  always  exist.  However,  in  section  V  we  show 
the  indexability  of  our  particular  problem  by  obtaining  a 
closed  form  expression  of  Whittle’s  indices,  which  is  the 
main  result  of  the  paper.  We  also  obtain  in  closed  form 
the  subgradient  necessary  for  the  computation  of  the  upper 
bound  on  achievable  performance.  Finally  in  section  VI, 
we  verify  experimentally  the  high  performance  of  the  index 
policy  by  comparing  it  to  the  upper  bound  for  problems 
involving  a  large  number  of  targets  and  vehicles. 

II.  Problem  Formulation 

For  the  dynamic  optimization  problem  described  in  the 
introduction,  the  state  of  the  N  sites  at  time  t  is  x(  = 
(xj , . . .  ,x^)  €  {si ,  *2}^-  and  the  control  is  to  decide  which  M 
sites  to  observe.  An  action  at  time  t  can  only  depend  on  the 
information  state  It  which  consists  of  the  actions  a 0, . . .  ,at-i 
at  previous  times  as  well  as  the  observations  yoi  •  •  •  ,37-1  and 
the  prior  information  v-i  on  the  initial  state  xq.  We  represent 
an  action  at  by  the  vector  (aj a^)  G{0,1  }N,  where  a\=  1 
if  site  ;  is  visited  by  a  sensor  at  time  f,  and  a\  =  0  otherwise. 

Assume  the  following  flow  of  events.  Given  our  current 
information  state,  we  make  the  decision  as  to  which  M  sites 
to  observe.  The  rewards  are  obtained  depending  on  the  states 
observed,  and  the  information  state  is  updated.  Once  the 
rewards  have  been  collected,  the  states  of  the  sites  evolve 
according  to  the  known  transition  probabilities. 


Let  p  be  a  given  probability  distribution  on  the  initial  state 
xo.  We  assume  independence  of  the  initial  distributions,  i.e., 

P(xo  =  s\---,x%  =sN)  =  p(s\...,sN) 

=fl(pu)l{s'=sl}(l-pi-l)l{s'=s2}, 

i=  1 

for  some  given  numbers  p'  x  €  [0,1].  We  denote  by  1  { • } 
the  indicator  function.  For  an  admissible  policy  n,  i.e., 
depending  only  on  the  information  process,  we  denote  E"  the 
expectation  operator.  We  want  to  maximize  over  the  set  II  of 
admissible  policies  the  expected  infinite-horizon  discounted 
reward  (with  discount  factor  a) 


where 


{00 

£Vr  (xt,at) 
t= 0 


N 


r{x,,at)  =  Y,Rl  1 K 
1=1 


l,xj  =Si}, 


(1) 


and  subject  to  the  constraint 


£l{aj=l  }=M,W. 

i=l 


(2) 


It  is  well  known  that  we  can  reformulate  this  problem  as 
an  equivalent  Markov  decision  process  (MDP)  with  complete 
information  [17],  A  sufficient  statistic  for  this  problem  is 
given  by  the  conditional  probability  P(x,  ]/,),  so  we  look  for 
an  optimal  policy  of  the  form  nt(P(xt\It)).  An  additional 
simplification  in  our  problem  comes  from  the  fact  that  the 
sites  are  assumed  to  evolve  independently.  Let  p\  be  the 
probability  that  site  i  is  in  state  ,Vj  at  time  t,  given  It.  A  simple 
sufficient  statistic  at  time  t  is  then  (pj  ,...,p^)  €  [0,  l]w. 

We  have  the  following  recursion: 


Pt+ 1  = 


Pln,  if  site  i  is  visited  at  time  t  and  found 
in  state  ,v  1 . 

P'2 1,  if  site  i  is  visited  at  time  t  and  found 
<  in  state  «2- 

f’(Pl)  :=  P^t  +  (1-Pj)4 
=  P\  1  +  p\(P[  1  —  Pi  1),  if  site  ;  is  not  visited 
at  time  t. 


(3) 


III.  Non-Optimality  of  the  Greedy  Policy 
We  can  first  try  to  solve  the  problem  formulated  above 
with  a  general  purpose  solver  for  partially  observable  MDPs. 
However,  the  computations  become  quickly  intractable,  since 
the  size  of  the  underlying  state  space  increases  exponentially 
with  the  number  of  sites.  Moreover,  this  approach  would  not 
take  advantage  of  the  structure  of  the  problem,  notably  the 
independent  evolution  of  the  sites.  We  would  like  to  use  this 
structure  to  design  optimal  or  good  suboptimal  policies  more 
efficiently. 

There  is  an  obvious  candidate  solution  to  this  problem, 
which  consists  in  selecting  at  each  period  the  M  sites  for 
which  p\R'  is  the  highest.  Let  us  call  this  policy  the  “greedy 


1 
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Fig.  1.  Counter  Example. 


policy”.  It  is  not  optimal  in  general.  To  see  this,  it  is  sufficient 
to  consider  a  simple  example  with  completely  deterministic 
transition  rules  but  uncertainty  on  the  initial  state.  This 
underlines  the  importance  of  exploring  at  the  right  time. 

Consider  the  example  shown  on  Fig.  1,  with  N  =  2,  M  =  l. 
Assume  that  we  know  already  at  the  beginning  that  site  1 
is  in  state  sq,  i.e.,  p  \  ]  =  1.  Hence  we  know  that  every  time 
we  select  site  1,  we  will  receive  a  reward  Rl,  and  in  effect 
this  makes  state  S2  of  site  1  obsolete.  Assume  R 
but  (1  —  p2_i)R2  >R1,  i.e.,  R2  ~Rl  >  p2_xR2 .  Let  us  denote 
p2_  j  :=  p2  for  simplicity.  The  greedy  policy,  with  associated 
reward-to-go  Jg,  first  selects  site  1,  and  we  have 

Jg(l,p2)=R1  +  aJg(l,l-p2). 

During  the  second  period  the  greedy  policy  chooses  site  2. 
Hence 


IV.  Restless  Bandits 

The  optimization  problem  (1)  subject  to  the  resource 
constraint  (2)  seems  difficult  to  solve  directly.  However  one 
can  obtain  an  upper  bound  on  the  achievable  performance 
by  relaxing  the  constraint  (2)  to  enforce  it  only  on  average. 
More  specifically,  we  replace  it  by  the  following  constraint 

f  “  N  j  M 

or  equivalently  by 

D{p,n)  =  E*  {  (4) 

(f=0  /— l  J  1  a 

Clearly  (4)  is  implied  by  (2),  so  solving  the  optimization 
problem  (1)  with  relaxed  constraint  (4)  indeed  provides  an 
upper  bound  on  the  achievable  performance.  This  relaxed 
problem  can  now  be  solved  using  the  tools  available  for 
constrained  MDPs.  The  two  main  (dual)  approaches  are  a 
direct  linear  programming  formulation  on  the  set  of  occu¬ 
pation  measures,  or  a  Lagrangian  approach  using  dynamic 
programming  ideas  [16].  In  addition  to  solving  the  relaxed 
problem,  we  would  also  like  to  use  its  solution  to  obtain  a 
feasible  policy  for  the  original  problem.  We  do  this  by  using 
the  additional  restless  bandits  structure. 

To  study  the  restless  bandits  problem.  Whittle  used  the 
Lagrangian  approach  for  the  constrained  MDP,  which  we 
also  follow  here.  The  following  results  can  be  found  in  [16, 
chapter  3],  Define  the  Lagrangian 


Jg(l,l-p2)  =  (l-p2)R2  +  a(l-p2)Jg(l,0)  +  ap2Jg(l,l). 


Note  that  7?(1,0)  and  Jy  (1,1)  are  also  the  optimal  values  for 
the  reward-to-go  at  these  states,  because  the  greedy  policy 
is  obviously  optimal  once  all  uncertainty  has  been  removed. 
It  is  easy  to  compute 


.4(1,0) 


Rl  +  aR2 
l-  a2 


■4(1,1) 


R2  +  aR] 
1  -  a 2 


Now  suppose  we  sample  first  at  site  2,  removing  the 
uncertainty,  and  then  follow  the  greedy  policy,  which  is 
optimal.  We  get  for  the  associated  reward-to-go: 

J(\,p2)  =  p2R2  +  ap2Jg(l,0)  +  a(l-p2)Jg(l,l). 


We  compute  the  difference  and  obtain  after  some  calcula¬ 
tions: 

J-Jg  =  p2R2-R1+ap2Rl. 

For  example,  we  can  take  R2  =  3Rl,  p2  =  (1  —  e)/3,  for  a 
small  e  >  0.  We  get  p2R2  =Rl(  1  — e)  <  Rl  and  (1  —  p2)R 2  = 
(2  —  e)Rl  >  R 1  so  our  assumptions  are  satisfied.  Then  J  — 
Jg  =  “R'(l  —  £  —  t|),  which  can  be  made  positive  for  e 
small  enough,  and  as  large  as  we  want  by  simply  scaling  the 
rewards.  Hence  in  this  case  it  is  better  to  first  inspect  site  2 
than  to  follow  the  greedy  policy  from  the  beginning. 


,  , .  .  .  .  (  .  .  N  —  M 

L{p,7i,X)  =  J(p,7i)  +  X  (  D{p,n)~  — — 

with  1  gR  a  Lagrange  multiplier.  Then  the  optimal  reward 
for  the  problem  with  averaged  constraint  satisfies 

J*(p)  =  supinfL(^,7r,A)  =  sup  inf L(p,n,X), 

TteU  A  TTGlIs  ^ 

where  II5  is  the  set  of  stationary  Markov  (randomized) 
policies.  Since  we  allow  for  randomized  policies,  a  classical 
minimax  theorem  allows  us  to  interchange  the  sup  and  the 
inf  to  get 

f  N-M) 

7*4)=inlj7*(p;A)-AT-crj  (5) 

where 


J*{p\l)=  sup  {J{p,n)  +  XD{p,n))}  (6) 

nsYlD 

=  sup  £p{£a' 

D  (r=0 

and  II/)  is  now  the  set  of  stationary  deterministic  policies.  For 
a  fixed  A,  J*(p~,  A)  can  be  computed  using  dynamic  program¬ 
ming,  and  the  possibility  to  restrict  to  deterministic  policies 
is  a  classical  result  for  unconstrained  dynamic  programming. 
Moreover,  the  computation  of  J*(p~,  A)  has  the  interesting 


N  ) 

\{a\  =  l,x‘r  =  si}  +  Al{a(  =  0}  >  , 


property  of  being  separable  by  site.  Hence  we  can  solve  the 
dynamic  programming  problem  for  each  site  separately: 

A)  =  £/*■'(/>;  A) 

i=  1 

A* ’V;  A)  =  max  {A  +  aJ*'i(fipi\X), 

P'R'  +  ap^iPiu  A)  +  «(1  -  i ;  A) } , 

the  second  equation  being  Bellman’s  equation  for  site  i. 

We  can  now  finish  the  computation  of  the  upper  bound 
using  standard  dual  optimization  methods.  Suppose  that  we 
are  given  a  prior  p  on  the  initial  states  of  the  sites.  The  dual 
function,  which  we  would  like  to  minimize  over  A,  is 

N-M 

G(p;A)  =  /(/?;  A)  -X- - . 

1  -a 

G  is  a  convex  function  of  A,  although  in  general  not  differ¬ 
entiable.  We  can  solve  the  minimization  problem  (5)  using 
the  subgradient  method,  although  an  even  simpler  method 
such  as  a  line  search  would  also  be  possible.  We  have  the 
following  well-known  result,  see  e.g.  [18]: 

Theorem  1:  A  subgradient  of  G(p;  •)  at  A  is 


Hence  a  bandit  is  indexable  if  the  set  of  states  for  which  it 
is  optimal  to  take  the  passive  action  increases  with  the  sub¬ 
sidy  for  passivity.  This  requirement  seems  very  natural.  Yet 
Whittle  provided  an  example  showing  that  it  is  not  always 
satisfied,  and  typically  showing  the  indexability  property  for 
particular  cases  of  the  RB  problem  is  challenging,  see  e.g. 
[19],  [20].  However,  when  this  property  could  be  established, 
Whittle’s  index  policy,  which  we  now  describe,  was  found 
empirically  to  perform  outstandingly  well.  [21]  also  studied 
a  form  of  asymptotic  optimality  for  this  heuristic. 

Definition  3:  If  a  bandit  is  indexable,  its  Whittle  index  is 
given,  for  any  p  £  [0, 1],  by 

A  (p)  =  inf  {A  Sl:pS  ^(A)} . 

Hence,  if  the  bandit  is  in  state  p,  A  (p)  is  the  value  of  the  sub¬ 
sidy  A  which  renders  the  active  and  passive  actions  equally 
attractive.  Then,  restoring  the  superscripts  i  for  the  N  bandits, 
and  assuming  that  each  bandit  is  indexable,  we  obtain  for 
state  (p}, . . .  ,p^)  a  set  of  indices  A 1  , XN {p^ ).  The 

index  heuristic  applies  at  each  period  t  the  active  action  to 
the  M  projects  with  largest  indices  A  ‘(pi),  and  the  passive 
action  to  the  remaining  N  —  M  projects. 


D(P,K) 


N-M 
1  -  a 


N 


l >VX‘) 


i=  1 


N-M 
1  -  a  ’ 


(7) 


where  nX  is  an  optimal  policy  for  the  problem  (6)  (which 
can  be  decomposed  into  optimal  policies  nV  for  each  site), 
and 

*,i  f  <*> 

D\pi,^l)  =  E}  £«'1M  =  0} 

l=o 

We  will  see  in  section  V  that  an  expression  for  D{p,n"f)  is 
obtained  at  no  additional  cost  once  we  have  an  expression 
for  J*(p\  A). 

So  far  however,  we  have  only  provided  a  means  to 
compute  an  upper  bound  on  the  achievable  performance. 
It  remains  to  find  a  good  policy  for  the  original,  path 
constrained  problem.  Whittle  proposed  an  index  policy  which 
generalizes  Gittins’  policy  for  the  multi-armed  bandit  prob¬ 
lem  and  emerges  naturally  from  the  Lagrangian  relaxation. 
We  underline  here  only  the  key  ideas  and  refer  the  reader  to 
[8]  for  more  details  and  motivations  behind  this  heuristic. 

To  compute  Whittle’s  indices,  we  consider  the  bandits  (or 
targets)  individually.  Hence  we  isolate  bandit  i,  consider  the 
computation  problem  for  ■/*’'(/?';  A)  and  drop  the  superscript 
identifier  i  for  simplicity.  A  can  be  viewed  as  a  “subsidy  for 
passivity”,  which  parametrizes  a  collection  of  MDPs.  Let  us 
denote  by  ^(A)  C  [0, 1]  the  set  of  information  states  p  of 
the  bandit  such  that  the  passive  action  is  optimal,  i.e.. 


&>{k)  =  {/?€  [0,1] :  A +  «/*(/>;  A)  >  pR  +  apJ*(Pn; A) 

+a(l-p)J*(P2X-X)}. 


Definition  2;  A  bandit  is  indexable  if  XP’(A)  is  monoton- 
ically  increasing  from  0  to  [0, 1]  as  A  increases  from  —  °°  to 
+°°,  i.e., 

Ai  <  At  =>  ^(Ai)  C  ^(Ai). 


V.  Indexability  and  Computation  of  Whittle’s 
Indices 

A.  Preliminaries 

In  this  section  we  give  an  overview  of  the  study  of  the 
indexability  property  for  each  site.  Due  to  space  constraints, 
most  of  the  computations  are  not  presented.  The  interested 
reader  can  find  them  in  our  technical  report  [22],  For  the 
sensor  management  problem  considered  in  this  paper,  we 
show  that  the  bandits  are  indeed  indexable  and  compute  the 
Whittle  indices  in  closed  form. 

Since  the  discussion  is  concerned  with  a  single  site,  we 
drop  the  superscript  i.  For  reference  we  rewrite  Bellman’s 
equation  of  optimality  for  this  problem.  If  J  is  the  optimal 
value  function,  then 

J(p)  =  max  { A  +  aJ(fp)  ,pR  +  apJ(Pn)  +  a  ( 1  -  p)J(P2\)} 

(8) 

where  fp  :=  pPn  +  (1  - p)Pi l  =  Pn  +  p{Pu  - Pn)- 

Note  that  for  simplicity,  we  dropped  the  A  and  the  *  from 
the  previous  notation,  i.e.,  J(p)  :=J*(p\ A).  First  we  have 
Theorem  4:  J  is  a  convex  function  of  p,  continuous  on 
[0,1]- 

Proof:  It  is  well  known  that  we  can  obtain  the  value 
function  by  value  iteration  as  a  uniform  limit  of  cost  func¬ 
tions  for  finite  horizon  problems,  which  are  continuous, 
piecewise  linear  and  convex,  see  e.g.  [23],  The  uniform 
convergence  follows  from  the  fact  that  the  discounted  dy¬ 
namic  programming  operator  is  a  contraction  mapping.  The 
convexity  of  J  follows,  and  the  continuity  on  the  closed 
interval  [0, 1]  is  a  consequence  of  the  uniform  convergence. 

■ 

Lemma  5:  1)  When  A  <  pR,  it  is  optimal  to  take  the 

active  action.  In  particular,  if  A  <  0,  it  is  always  optimal 


to  take  the  active  action  and  J  is  affine: 

J(p)  =  aJ{P2i)  +  p[R  +  a{J(Pn)-J(P2i))\ 
(aPn+p(l-a))R 
{l-a){l-a(Pn-P2i))' 

2)  When  A  >  B,  it  is  always  optimal  to  take  the  passive 
action,  and 

J(p)  =  (10) 

Proof:  By  convexity  of  J,  J(fp)  <  pJ(P\  i )  +  (1  — 
p)J(P2\ )  and  so  for  A  <  pR,  it  is  optimal  to  choose  the  active 
action.  The  rest  of  1  follows  by  easy  calculation,  solving  first 
for  J(P\  i )  and  J(P2 i ).  To  prove  2,  use  value  iteration,  starting 
from  Jq  =  0.  ■ 

With  this  lemma,  it  is  sufficient  to  consider  from  now  on 
the  situation  0  <  A  <  R. 

Lemma  6:  The  set  of  p  £  [0, 1]  where  it  is  optimal  to 
choose  the  active  action  is  convex,  i.e.,  an  interval  in  [0,1]. 
Proof:  In  the  set  where  the  active  action  is  optimal,  we 

have 

J{p)  =  pR+  cc pJ{P\\)  +  a  (1  -p)J(P2 1). 

Consider  p\  and  p2  in  this  set.  We  want  to  show  that  for 
all  /j  e  [0, 1],  it  is  also  optimal  to  choose  the  active  action 
at  p  =  j8pi  +  (1  —  f})p2.  We  know  from  Bellman’s  equation 
(8)  that 


pR+apJ{Pu)  +  a{\  -p)J{P2\)  <  J{p). 

By  convexity  of  /,  we  have 

J(p)  </5J(pi)  +  (l-p)J(p2) 

J(p)  <P  (p\R  +  ap\J(Pu)  +  -pi)J{P2i))  + 

(1-15)  (P2 R  +  aP2J(Pn)  +  a(\  - p2)J(P2 1)) 

J(p)  <pR  +  apJ(Pn)  +  a(l  p)J (P2\ )  • 

Combining  the  two  inequalities,  we  see  that  the  active  action 
is  optimal  at  p.  ■ 

Lemma  7:  The  sets  of  p  £  [0, 1]  where  the  passive  and 
active  actions  are  optimal  are  of  the  form  [0,p*]  and  \p* ,  1], 
respectively. 

Proof:  This  follows  from  the  convexity  of  the  active 
set  and  the  fact  that  the  active  action  is  optimal  for  p  >  j) 
by  lemma  5.  ■ 

In  the  following,  we  emphasize  the  dependence  of  p*  on 
A  by  writing  p*( A).  It  is  a  direct  consequence  of  lemma 
7  and  the  continuity  of  J  that  p*( A)  is  a  value  where  the 
passive  and  the  active  actions  are  equally  attractive.  We  also 
see  that  to  show  the  indexability  property  of  definition  2,  it  is 
sufficient  to  show  that  p*( A)  is  an  nondecreasing  function  of 
A.  Then,  Whittle’s  index  is  obtained  by  inverting  the  relation 
A  — >  p*( A),  i.e.. 


A  (p)  =  inf  {A  :  p*(  A)  =  p} . 

An  interesting  feature  of  our  problem  is  that  it  is  possible 
to  compute  p*( A)  in  closed  form.  In  addition  we  can 


also  compute  the  value  function  J(p)  :=J*(p; A)  and  the 
“discounted  passivity  measure”  for  each  bandit: 

D(p,^)=£;I|f  a'l{«;=0}|. 

This  last  quantity  is  necessary  to  compute  the  subgradient 
(7).  Its  computation  is  a  policy  evaluation  problem.  D(p1nV) 
obeys  the  equations 

f  aPD(Pl  unl)  +  a(l-p)D(P2Unl), 
D(P^l)=l  for  p>  p*(X) 

[l  +  aD(fp,nl),  for  p  <  p*(X). 

These  equations  can  be  compared  to  those  verified  by 
J*(p\ A)  once  p*( A)  is  known: 


(R  +  apJ*(Pn,X)  +  a(l-p)J*(P2l,X), 
J*(p-,X)  =  <  forp>p*(X) 

[X  +  aJ*(fp}X),  for  p  <  p*(X). 

Hence  it  is  sufficient  to  have  a  closed  form  solution  for 
J*(p;X).  To  compute  Dip.  nf),  we  simply  formally  set  R  =  0 
and  A  =  1  in  the  corresponding  expression  for  J*(p\ A). 
For  example,  starting  from  expressions  (9)  and  (10),  we 
recover  the  (trivial)  result  that  D(p,nf)  =  0  if  A  <  0  and 
D(p,nl)  =  1/(1 -a)  if  A  >R. 

The  computation  of  p*( A),  J*(p\  A)  and  D(p,n for  each 
bandit  can  be  performed  by  distinguishing  between  various 
cases  depending  on  the  value  of  the  parameters  Bn  and 
P21.  The  computations  are  rather  long  and  we  omit  them 
in  this  paper.  Here  we  only  give  the  main  result,  which  is 
the  expression  of  the  Whittle  indices. 

Theorem  8:  A  two-state  restless  bandit  as  considered  in 
this  section  is  indexable.  The  index  A  (p)  can  be  computed 
as  follows.  Let  ,v  =  Bn  —  P2\  (then  1  <  s  <  1),  fnP2\  = 

B2,V^.and/=  WT 

1)  Case  5  =  0:  X(p)  =  pR. 

2)  Case  5=1:  A  (p)  =  T-J*- 

3)  Case  0<5<  1: 

•  If  B  >  p\  i  or  P  <  P21  ■  A(p)  =  pR. 

.  If/<p<Bn:A(p)  = 

.  If  B21  <  p  <  /:  Let  k  :=  k(p)  =  [ -  2.  Then 
let 

Bk  =  1  -  ak+2,  Ck  =  a-  ak+2, 

(1  -  aPn )Bk  +  ak+2(  1  -  a) (fk+lP2l ) 


-pY 


Ak  — 

We  have 


1  -  as 


Hp)  = 


A-k(p)  (1  P)Ck(p) 

4)  Case  s  =  —  1: 

.  Iff>l/2:AM=1+y^V. 

.  Ifp<l/2:  X(p)  =  JPcrR. 

5)  Case  —1  <  5  <  0: 

•  If  P  >  Pn  or  p  <  Bn :  X(p)  =  pR. 
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Fig.  2.  Monte-Carlo  Simulation  for  Whittle’s  index  policy  and  the  greedy 
policy.  The  upper  bound  is  computed  using  the  subgradient  optimization 
algorithm.  We  fixed  a  —  0.95. 


•  HfPu<P<P2V.  HP)=Ta^iR- 

•  If  /  <  P  <  fPn:  X(P)  =  1+g(1  P^:;^PnR- 
.  If Pu<P<I-.Kp)  =  t^pu)R- 

VI.  Simulation  Results 

In  this  section,  we  briefly  present  some  simulation  results 
illustrating  the  performance  of  the  index  policy  and  the 
quality  of  the  upper  bound.  We  generate  sites  with  random 
rewards  R'  within  given  bounds  and  random  parameters 
P\i,  Pn\ .  We  progressively  increase  the  size  of  the  problem 
by  adding  new  sites  and  UAVs  to  the  existing  ones.  We 
keep  the  ratio  M/N  constant,  in  this  case  M/N=  1/20. 
When  generating  new  sites,  we  only  ensure  that  \P\  i  /L  i 
is  sufficiently  far  from  0,  which  is  the  case  where  the 
index  policy  departs  significantly  from  the  simple  greedy 
policy.  The  upper  bound  is  computed  for  each  value  of  N 
using  the  subgradient  optimization  algorithm.  The  expected 
performance  of  the  index  policy  and  the  greedy  policy  are 
estimated  via  Monte-Carlo  simulations. 

Fig.  2  shows  the  result  of  simulations  for  up  to  N  =  3000 
sites.  We  plot  the  reward  per  agent,  dividing  the  total  reward 
by  M,  for  readability.  We  can  see  the  consistently  stronger 
performance  of  the  index  policy  with  respect  to  the  simple 
greedy  policy,  and  in  fact  its  almost  optimality. 

VII.  CONCLUSION 

We  have  proposed  the  application  of  Whittle’s  work  on 
restless  bandits  in  the  context  of  a  UAV  routing  problem  with 
partial  information.  For  given  problem  parameters,  we  can 
compute  an  upper  bound  on  the  achievable  performance,  and 
experimental  results  show  that  the  performance  of  Whittle’s 
index  policy  is  often  very  close  to  the  upper  bound.  This  is 
in  agreement  with  existing  work  on  restless  bandit  problems 
for  different  applications.  Some  directions  for  future  work 
include  a  better  understanding  the  asymptotic  performance 
of  the  index  policy  and  the  computation  of  the  indices  for 
more  general  state  spaces. 
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